generalized smoothness
Characterizing Online and Private Learnability under Distributional Constraints via Generalized Smoothness
Blanchard, Moïse, Shetty, Abhishek, Rakhlin, Alexander
Understanding minimal assumptions that enable learning and generalization is perhaps the central question of learning theory. Several celebrated results in statistical learning theory, such as the VC theorem and Littlestone's characterization of online learnability, establish conditions on the hypothesis class that allow for learning under independent data and adversarial data, respectively. Building upon recent work bridging these extremes, we study sequential decision making under distributional adversaries that can adaptively choose data-generating distributions from a fixed family $U$ and ask when such problems are learnable with sample complexity that behaves like the favorable independent case. We provide a near complete characterization of families $U$ that admit learnability in terms of a notion known as generalized smoothness i.e. a distribution family admits VC-dimension-dependent regret bounds for every finite-VC hypothesis class if and only if it is generalized smooth. Further, we give universal algorithms that achieve low regret under any generalized smooth adversary without explicit knowledge of $U$. Finally, when $U$ is known, we provide refined bounds in terms of a combinatorial parameter, the fragmentation number, that captures how many disjoint regions can carry nontrivial mass under $U$. These results provide a nearly complete understanding of learnability under distributional adversaries. In addition, building upon the surprising connection between online learning and differential privacy, we show that the generalized smoothness also characterizes private learnability under distributional constraints.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
Gradient-Variation Online Learning under Generalized Smoothness
Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic o ptimization, hence receiving increased attention. Existing results often req uire the smoothness condition by imposing a fixed bound on gradient Lipschitzness, w hich may be unrealistic in practice. Recent efforts in neural network optim ization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-var iation online learning under generalized smoothness. We extend the classic optimi stic mirror descent algorithm to derive gradient-variation regret by analyzin g stability over the optimization trajectory and exploiting smoothness locally. Th en, we explore universal online learning, designing a single algorithm with the optimal gradient-va riation regrets for convex and strongly convex functions simultane ously, without requiring prior knowledge of curvature. This algorithm adopts a tw o-layer structure with a meta-algorithm running over a group of base-learners . To ensure favorable guarantees, we design a new Lipschitz-adaptive meta-a lgorithm, capable of handling potentially unbounded gradients while ensuring a second-order bound to effectively ensemble the base-learners. Finally, we provi de the applications for fast-rate convergence in games and stochastic extended adv ersarial optimization.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.68)
Convex and Non-convex Optimization Under Generalized Smoothness
Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
Gradient-Variation Online Learning under Generalized Smoothness
Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic o ptimization, hence receiving increased attention. Existing results often req uire the smoothness condition by imposing a fixed bound on gradient Lipschitzness, w hich may be unrealistic in practice. Recent efforts in neural network optim ization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-var iation online learning under generalized smoothness. We extend the classic optimi stic mirror descent algorithm to derive gradient-variation regret by analyzin g stability over the optimization trajectory and exploiting smoothness locally. Th en, we explore universal online learning, designing a single algorithm with the optimal gradient-va riation regrets for convex and strongly convex functions simultane ously, without requiring prior knowledge of curvature. This algorithm adopts a tw o-layer structure with a meta-algorithm running over a group of base-learners . To ensure favorable guarantees, we design a new Lipschitz-adaptive meta-a lgorithm, capable of handling potentially unbounded gradients while ensuring a second-order bound to effectively ensemble the base-learners. Finally, we provi de the applications for fast-rate convergence in games and stochastic extended adv ersarial optimization.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.68)
Gradient-Variation Online Learning under Generalized Smoothness
Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic optimization, hence receiving increased attention. Existing results often require the smoothness condition by imposing a fixed bound on gradient Lipschitzness, which may be unrealistic in practice. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-variation online learning under generalized smoothness. We extend the classic optimistic mirror descent algorithm to derive gradient-variation regret by analyzing stability over the optimization trajectory and exploiting smoothness locally.
Efficiently Escaping Saddle Points under Generalized Smoothness via Self-Bounding Regularity
Cao, Daniel Yiming, Chen, August Y., Sridharan, Karthik, Tang, Benjamin
In this paper, we study the problem of non-convex optimization on functions that are not necessarily smooth using first order methods. Smoothness (functions whose gradient and/or Hessian are Lipschitz) is not satisfied by many machine learning problems in both theory and practice, motivating a recent line of work studying the convergence of first order methods to first order stationary points under appropriate generalizations of smoothness. We develop a novel framework to study convergence of first order methods to first and \textit{second} order stationary points under generalized smoothness, under more general smoothness assumptions than the literature. Using our framework, we show appropriate variants of GD and SGD (e.g. with appropriate perturbations) can converge not just to first order but also \textit{second order stationary points} in runtime polylogarithmic in the dimension. To our knowledge, our work contains the first such result, as well as the first 'non-textbook' rate for non-convex optimization under generalized smoothness. We demonstrate that several canonical non-convex optimization problems fall under our setting and framework.
Convex and Non-convex Optimization Under Generalized Smoothness
Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum
Khirirat, Sarit, Sadiev, Abdurakhmon, Riabinin, Artem, Gorbunov, Eduard, Richtárik, Peter
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China (0.04)
- Research Report > New Finding (0.48)
- Research Report > Experimental Study (0.34)
Near-Optimal Non-Convex Stochastic Optimization under Generalized Smoothness
Liu, Zijian, Jagabathula, Srikanth, Zhou, Zhengyuan
The generalized smooth condition, $(L_{0},L_{1})$-smoothness, has triggered people's interest since it is more realistic in many optimization problems shown by both empirical and theoretical evidence. Two recent works established the $O(\epsilon^{-3})$ sample complexity to obtain an $O(\epsilon)$-stationary point. However, both require a large batch size on the order of $\mathrm{ploy}(\epsilon^{-1})$, which is not only computationally burdensome but also unsuitable for streaming applications. Additionally, these existing convergence bounds are established only for the expected rate, which is inadequate as they do not supply a useful performance guarantee on a single run. In this work, we solve the prior two problems simultaneously by revisiting a simple variant of the STORM algorithm. Specifically, under the $(L_{0},L_{1})$-smoothness and affine-type noises, we establish the first near-optimal $O(\log(1/(\delta\epsilon))\epsilon^{-3})$ high-probability sample complexity where $\delta\in(0,1)$ is the failure probability. Besides, for the same algorithm, we also recover the optimal $O(\epsilon^{-3})$ sample complexity for the expected convergence with improved dependence on the problem-dependent parameter. More importantly, our convergence results only require a constant batch size in contrast to the previous works.
- North America > United States > New York (0.04)
- Asia > China (0.04)